Hadoop’s Overload Tolerant Design Exacerbates Failure Detection and Recovery∗
نویسندگان
چکیده
Data processing frameworks like Hadoop need to efficiently address failures, which are common occurrences in today’s large-scale data center environments. Failures have a detrimental effect on the interactions between the framework’s processes. Unfortunately, certain adverse but temporary conditions such as network or machine overload can have a similar effect. Treating this effect oblivious to the real underlying cause can lead to sluggish response to failures. We show that this is the case with Hadoop, which couples failure detection and recovery with overload handling into a conservative design with conservative parameter choices. As a result, Hadoop is oftentimes slow in reacting to failures and also exhibits large variations in response time under failure. These findings point to opportunities for future research on cross-layer data processing framework design.
منابع مشابه
A New Design of Fault Tolerant Comparator
In this paper we have presented a new design of fault tolerant comparator with a fault free hot spare. The aim of this design is to achieve a low overhead of time and area in fault tolerant comparators. We have used hot standby technique to normal operation of the system without interrupting and dynamic recovery method in fault detection and correction. The circuit is divided to smaller modules...
متن کاملAnalysis of Hadoop’s Performance under Failures
Failures are common in today’s data center environment and can significantly impact the performance of important jobs running on top of large scale computing frameworks. In this paper we analyze Hadoop’s behavior under compute node and process failures. Surprisingly, we find that even a single failure can have a large detrimental effect on job running times. We uncover several important design ...
متن کاملFault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology
The POWER4-based p690 systems offer the highest performance of the IBM eServer pSeries line of computers. Within the general-purpose UNIX server market, they also offer the highest levels of concurrent error detection, fault isolation, recovery, and availability. High availability is achieved by minimizing component failure rates through improvements in the base technology, and through design t...
متن کاملSomersault Software Fault-Tolerance
software fault-tolerance, process replication failure masking, continuous availability, topology The ambition of fault-tolerant systems is to provide application transparent fault-tolerance at the same performance as a non-fault-tolerant system. Somersault is a library for developing distributed fault-tolerant software systems that comes close to achieving both goals. We describe Somersault and...
متن کاملFault-Tolerant Wireless Multihop Transmissions with Byzantine Failure Detection
Wireless multihop networks consist of numbers of wireless nodes. Hence, introduction of failure detection and recovery is mandatory. Until now, various failure detection and recovery methods such as route switch and multiple routes detection have been proposed based on an assumption with stop failure model. However, the assumption that failed wireless nodes never transmit any messages is too re...
متن کامل